The term Support Vector Machines (SVM's) is sometimes used loosely to refer to three methods, each an extension of the previous method1:
SVM's are a common supervised discriminative algorithm, well suited to complex small- to medium sized datasets2
They can be used for both classification and regression.
To demonstrate SVM's I'll be using Fisher's (or Anderson's) Iris flowers dataset3.
The dataset consists of 50 samples each from three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
Figure 1: Iris Flowers
Five attributes were collected for the 150 records.
| Sepal length (cm) | Sepal width (cm) | Petal length (cm) | Petal width (cm) | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
Figure 2: Iris Attributes
In $p$-dimensional space, a hyperplane is a flat affine subspace of dimension $p-1$.
In more detail1:
Two-dimensions: $\beta_0 + \beta_1\mathbf{x}_1 + \beta_2\mathbf{x}_2 = 0$
Three-dimensions: $\beta_0 + \beta_1\mathbf{x}_1 + \beta_2\mathbf{x}_2 + \beta_3\mathbf{x}_3 = 0$
P-dimensions: $\beta_0 + \beta_1\mathbf{x}_1 + \beta_2\mathbf{x}_2 + ... + \beta_p\mathbf{x}_p = 0$
If $x = (\mathbf{x}_1, ..., \mathbf{x}_p)$ satisfies above, then it is a point on the hyperplane.
If $\beta_0 + \beta_1\mathbf{x}_1 + \beta_2\mathbf{x}_2 + ... + \beta_p\mathbf{x}_p > 0$, it lies on one side of the hyperplane,
so if, $\beta_0 + \beta_1\mathbf{x}_1 + \beta_2\mathbf{x}_2 + ... + \beta_p\mathbf{x}_p < 0$, its on the other side.
We aim to classify an $n \times p$ matrix of $n$ observations in $p$ dimensional space, with these observations falling into two classes $y_1,...,y_n \in \{-1,1\}$.
If we were to perfectly separate the classes the hyperplane would have the property that1:
$\beta_0 + \beta_1\mathbf{x}_{i1} + \beta_2\mathbf{x}_{i2} + ... + \beta_p\mathbf{x}_{ip} > 0 \quad \text{if} \ y_i = 1$,
$\beta_0 + \beta_1\mathbf{x}_{i1} + \beta_2\mathbf{x}_{i2} + ... + \beta_p\mathbf{x}_{ip} < 0 \quad \text{if} \ y_i = -1$.
For a new test observations $x^*$, we would look at the sign of:
$$f(x^*) = \beta_0 + \beta_1x_1^* + \beta_2x_2^* + ... + \beta_px_p^*.$$We would assign it to class 1 if $f(x^*)$ is positive and class -1 if negative.
Furthermore, we could use the magnitude of $f(x^*)$ to indicate how far the point lies from the hyperplane.
We need a reasonable way of constucting a hyperplane, out of the possible choices.
Maximal margin hyperplanes look at getting the hyperplane that is the furthest from the training observations - we compute the perpendicular distance from each training observation to a given separating hyperplane.
The maximal margin hyperplane is the separating hyperplane for which the margin is largest.
We hope the classifier with a large margin on the training data will generalise well to unseen test observations.
Another way of defining our hyperplane is:
$$\mathbf{w} \cdot \mathbf{x} + b = 0,$$which uses the dot product between a weight vector, $\mathbf{w}$, and our input vector $\mathbf{x}$, plus our bias, $b$ (sometimes defined as $w_0$).
In statistics the dot product of two n-dimensional vectors $\mathbf{a}$ and $\mathbf{b}$ is often noted as $\mathbf{a} \cdot \mathbf{b}$. The dot product is a type of inner product, often denoted as $\left<\mathbf{a},\mathbf{b}\right>$.
In Machine Learning, vectors are typically represented as column vectors, so the dot product is defined as $\mathbf{a}^{\mathrm T} \mathbf{b}$. Using this notation, our hyperplane is10:
$$\mathbf{w}^{\mathrm T}\mathbf{x} + b = 0.$$Our margins, where our labels $y \in \{-1,1\}$, are then:
$$ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \geq 1 \text{ if } y_i = 1, \\ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \leq -1 \text{ if } y_i = -1, \\ \text{for } i = 1,\ldots,n, $$which can be more compactly written as: $$ y_i\left(\mathbf{w}^{\mathrm T}\mathbf{x}_i + b\right) \geq 1 \quad \forall_i $$
There are two equidistant points from the maximal margin hyperplane, lying on the dashed lines. These observations are called Support Vectors.
We can call these two points, $x_1$ and $x_2$ as below:
$$\mathbf{w}^{\mathrm T}x_1 + b = 1,$$$$\mathbf{w}^{\mathrm T}x_2 + b = -1.$$--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-b0767983d165> in <module> ----> 1 fig, ax = plt.subplots(figsize = (width_inch, height_inch)) 2 plot_svc_margin(svm_clf, highlight=True, highlight_line=True, ax=ax) 3 fig_num+=1 4 reduced_plot(iris_reduced, X_AX_LABEL, Y_AX_LABEL, "Figure %d: Support Vectors"%(fig_num)) 5 plt.show() NameError: name 'plt' is not defined
If we move our support vectors our hyperplane will too.
This is because the maximal margin hyperplane only depends on these support vectors.
Other data points could be moved without the hyperplane moving.
We want to maximise the distance between the margin lines, on which the points lie10.
$$x_1-x_2$$
$$
\frac{\phantom{\quad}\mathbf{w}^{\mathrm T}x_1 + b = 1\\ -
\mathbf{w}^{\mathrm T}x_2 + b = -1 \\}{\\ \mathbf{w}^{\mathrm T}(x_1 - x_2) = 2}
$$
We can normalise this equation by the length (Euclidean norm) of the vector $\mathbf{w}$:
$\text{max}\frac{2}{||\mathbf{w}||}$, while classifying everything correctly, $y_i(\mathbf{w}^{\mathrm T}\mathbf{x}_i+b) \geq 1 \quad \forall_i$.
Or, instead:
${\text{min} \atop \mathbf{w}, b}\frac{1}{2}||\mathbf{w}||^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^{\mathrm T}\mathbf{x}_i+b) \geq 1 \quad \forall_i$,
which is easier because this is a convex quadratic optimisation problem which is efficiently solvable using quadratic programming algorithms.
This requires a Lagrangian formulation of the problem so we introduce Lagrange multipliers, $\alpha_i, i = 1, \ldots , n$:
$$ \min L_P = \frac{1}{2}||\mathbf{w}||^2 - \sum^n_{i=1} \alpha_iy_i(\mathbf{w}^{\mathrm T}\mathbf{x}_i+b) + \sum^n_{i=1} \alpha_i \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0 $$Removes the dependence on $\mathbf{w}$ and $b$,
$$ \max L_D = \sum_i^n\alpha_i - \frac{1}{2}\sum_i^n\sum_k^n\alpha_i\alpha_ky_iy_k\mathbf{x}_i^{\mathrm T} \mathbf{x}_k \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0. $$To achive this we first need to set a constaint,
$$\sum^n_i\alpha_iy_i = 0.$$Now we can find the vector $\hat\alpha$ that minimises the equasion using a QP solver.
Knowing our $\hat\alpha_i$ means we can find the weights, $\mathbf{\hat w}$, which are a linear combination of the training inputs, $\mathbf{x}_i$, training outputs, $y_i$, and the values of $\alpha$,
$$\mathbf{\hat w} = \sum_{i \in s}\hat\alpha_iy_i\mathbf{x}_i,$$where $s$ is the collection of indicies of support vectors6. We only need the support vectors as they are the only ones with $\hat\alpha_i > 0$.
Out bias term, $\hat b$, is then determined by2: $$ \hat b = \frac{1}{n_s}\sum^n_{i =1}\left(y_i-\mathbf{\hat w}^{\mathrm T}\mathbf{x}_i\right) $$ where $n_s$ is the number of support vectors.
Question: But why bother doing this? That was a lot of effort, why not just solve the original problem?
Answer: Because this will let us solve the problem by computing the just the inner products ($\mathbf{x}_i^{\mathrm T} \mathbf{x}_k$) which will be important when we want to solve non-linearly separable classification problems.
Case 1 Two observations $\mathbf{x}_i$, $\mathbf{x}_k$ are completely dissimilar (orthogonal), so their dot product is 0. They don't contribute to $L$.
Case 2 Two observations $\mathbf{x}_i$, $\mathbf{x}_k$ are similar and predict the same output value $y_i$ (ie. both $+1$ or $-1$). This means $y_i \times y_k = 1$ and the value $\alpha_i\alpha_ky_iy_k\mathbf{x}_i\mathbf{x}_k$ is positive. However this decreases the value of $L$, due to subtracting from the first term sum, $\sum_i^n\alpha_i$, so the algorithm downgrades similar feature vectors that make the same prediction.
Case 3 : Two observations $\mathbf{x}_i$, $\mathbf{x}_k$ predict opposite predictions about the output value $y_i$ (ie. one $+1$ and the other $-1$), but are otherwise similar. The value $\alpha_i\alpha_ky_iy_k\mathbf{x}_i\mathbf{x}_k$ is negative and since we are subtracting it, this adds to the sum. These are the examples that maximise the margin width.
The solution to the maximal margin classifier problem involves only the inner products of the observations:
$$ \left<x_i,x_{i^{\prime}}\right> = \sum^p_{j=1}x_{ij}x_{i^{\prime}j}. $$Assume we have a new point $x$. If wanted to compute $f(x)$ using our linear classifier we would need to the inner product between $x$ and each training point $x_i$:
$$f(x) = \sum^n_{i=1}\alpha_i\left<x,x_i\right>+ b.$$In the above case, for estimating the parameters $\alpha_1...,\alpha_n$ and $b$, we need the $n(n-1)/2$ inner products $\left<x_i,x_{i^\prime}\right>$ between all pairs of training observations.
However, $\alpha$ is nonzero only for support vectors, so if we have a collection of their indicies, $s$, we can do the following instead:
$$ f(x) = \sum_{i\in s}\alpha_i \left<x,x_i\right> + b. $$In other cases, no exact linear separating hyperplane exists. Therefore we may want to use a hyperplane that almost separates the two classes, allowing some errors, using a soft margin (Support Vector Classifier).
Now might be a good time to try exercises 1-4.